27 research outputs found
Training dynamics of neural language models
Why do artificial neural networks model language so well? We claim that in order to answer this question and understand the biases that lead to such high performing language models---and all models that handle language---we must analyze the training process. For decades, linguists have used the tools of developmental linguistics to study human bias towards linguistic structure. Similarly, we wish to consider a neural network's training dynamics, i.e., the analysis of training in practice and the study of why our optimization methods work when applied. This framing shows us how structural patterns and linguistic properties are gradually built up, revealing more about why LSTM models learn so effectively on language data.
To explore these questions, we might be tempted to appropriate methods from developmental linguistics, but we do not wish to make cognitive claims, so we avoid analogizing between human and artificial language learners. We instead use mathematical tools designed for investigating language model training dynamics. These tools can take advantage of crucial differences between child development and model training: we have access to activations, weights, and gradients in a learning model, and can manipulate learning behavior directly or by perturbing inputs. While most research in training dynamics has focused on vision tasks, language offers direct annotation of its well-documented and intuitive latent hierarchical structures (e.g., syntax and semantics) and is therefore an ideal domain for exploring the effect of training dynamics on the representation of such structure.
Focusing on LSTM models, we investigate the natural sparsity of gradients and activations, finding that word representations are focused on just a few neurons late in training. Similarity analysis reveals how word embeddings learned for different tasks are highly similar at the beginning of training, but gradually become task-specific. Using synthetic data and measuring feature interactions, we also discover that hierarchical representations in LSTMs may be a result of their learning strategy: they tend to build new trees out of familiar phrases, by mingling together the meaning of constituents so they depend on each other. These discoveries constitute just a few possible explanations for how LSTMs learn generalized language representations, with further theories on more architectures to be uncovered by the growing field of NLP training dynamics
Understanding Learning Dynamics Of Language Models with SVCCA
Research has shown that neural models implicitly encode linguistic features,
but there has been no research showing \emph{how} these encodings arise as the
models are trained. We present the first study on the learning dynamics of
neural language models, using a simple and flexible analysis method called
Singular Vector Canonical Correlation Analysis (SVCCA), which enables us to
compare learned representations across time and across models, without the need
to evaluate directly on annotated data. We probe the evolution of syntactic,
semantic, and topic representations and find that part-of-speech is learned
earlier than topic; that recurrent layers become more similar to those of a
tagger during training; and embedding layers less similar. Our results and
methods could inform better learning algorithms for NLP models, possibly to
incorporate linguistic information more effectively.Comment: Accepted for publication in NAACL 201
One Venue, Two Conferences: The Separation of Chinese and American Citation Networks
At NeurIPS, American and Chinese institutions cite papers from each other's
regions substantially less than they cite endogamously. We build a citation
graph to quantify this divide, compare it to European connectivity, and discuss
the causes and consequences of the separation.Comment: Workshop on Cultures of AI and AI for Culture @ NeurIPS 202
Pareto Probing: Trading Off Accuracy for Complexity
The question of how to probe contextual word representations for linguistic
structure in a way that is both principled and useful has seen significant
attention recently in the NLP literature. In our contribution to this
discussion, we argue for a probe metric that reflects the fundamental trade-off
between probe complexity and performance: the Pareto hypervolume. To measure
complexity, we present a number of parametric and non-parametric metrics. Our
experiments using Pareto hypervolume as an evaluation metric show that probes
often do not conform to our expectations---e.g., why should the non-contextual
fastText representations encode more morpho-syntactic information than the
contextual BERT representations? These results suggest that common, simplistic
probing tasks, such as part-of-speech labeling and dependency arc labeling, are
inadequate to evaluate the linguistic structure encoded in contextual word
representations. This leads us to propose full dependency parsing as a probing
task. In support of our suggestion that harder probing tasks are necessary, our
experiments with dependency parsing reveal a wide gap in syntactic knowledge
between contextual and non-contextual representations.Comment: Tiago Pimentel and Naomi Saphra contributed equally to this work.
Camera ready version of EMNLP 2020 publication. Code available in
https://github.com/rycolab/pareto-probin
Sparsity Emerges Naturally in Neural Language Models
Concerns about interpretability, computational resources, and principled
inductive priors have motivated efforts to engineer sparse neural models for
NLP tasks. If sparsity is important for NLP, might well-trained neural models
naturally become roughly sparse? Using the Taxi-Euclidean norm to measure
sparsity, we find that frequent input words are associated with concentrated or
sparse activations, while frequent target words are associated with dispersed
activations but concentrated gradients. We find that gradients associated with
function words are more concentrated than the gradients of content words, even
controlling for word frequency.Comment: Published in the ICML 2019 Workshop on Identifying and Understanding
Deep Learning Phenomena: https://openreview.net/forum?id=H1ets1h56
Evaluating Informal-Domain Word Representations With UrbanDictionary
Existing corpora for intrinsic evaluation are not targeted towards tasks in
informal domains such as Twitter or news comment forums. We want to test
whether a representation of informal words fulfills the promise of eliding
explicit text normalization as a preprocessing step. One possible evaluation
metric for such domains is the proximity of spelling variants. We propose how
such a metric might be computed and how a spelling variant dataset can be
collected using UrbanDictionary